Introduction
Airline passenger satisfaction is a crucial metric for firms in the airline industry. Understanding the factors that contribute to customer satisfaction is essential for airlines to improve their services and compete effectively; high market saturation, as well as low profit margins, can magnify the effects of small advantages or disadvantages relative to other firms (Lutz et al., 2012; Hardee, 2023). In this research, we will analyze various factors that affect airline passenger satisfaction — provided through a survey dataset — and, ultimately, judge their suitability for a regression model predicting passenger satisfaction.
Research Proposal
Our research will first look at individual variables in the aforementioned survey dataset to examine distributions and other characteristics. Then, we will identify a regression model that may be congruent with our dataset and test assumptions associated with the model.
We will leverage a Kaggle dataset that includes surveyed passenger characteristics, flight details, and satisfaction ratings for select pre-flight and in-flight components (Klein, 2020). To ensure modeling suitability, we will conduct exploratory data analysis, taking into account variable distributions and types.
SMART Questions
With our research, we aim to make progress towards answering the following questions:
To what extent do certain surveyed passenger characteristics and flight experience components impact the likelihood that a passenger will be satisfied – rather than neutral or dissatisfied – with their trip?
How can we model the likelihood of passenger satisfaction using surveyed passenger characteristics and flight experience components in a manner that minimizes predictive bias?
To what extent can we predict the likelihood that a flight passenger will be satisfied with their experience using multiple different variable levels?
Objective
This research offers an opportunity to assess the limitations of linear regression models in predicting passenger satisfaction, specifically with regards to the categorical nature of the output in this dataset. Through exploratory data analysis (EDA), we can identify the characteristics of our data and subsequently illustrate why a linear regression model may not be suitable for this analysis. This will lay the groundwork for our future research on logistic regression.
In summary, our research will provide insights into the intricate relationship between passenger characteristics, flight experience, and satisfaction levels. We will also explore the limitations of linear regression models and prepare the foundation for a more advanced logistic regression approach in future analysis.
Dataset Variables
The dataset for our research on airline passenger satisfaction contains various variables, which can be categorized into three types: continuous, categorical, and ordinal. In this section, we’ll list and briefly explain each of these variables.
Continuous Variables
Age: This variable represents the actual age of the passengers.
Flight Distance: Flight distance is the distance covered during the journey, measured in miles.
Departure Delay in Minutes: This variable indicates the number of minutes by which a flight was delayed during departure.
Arrival Delay in Minutes: Similarly, this variable represents the number of minutes by which a flight was delayed during arrival.
Categorical Variables
Gender: Gender is a categorical variable indicating the gender of the passengers.
Customer Type: The “Customer Type” variable categorizes passengers based on their customer loyalty.
Type of Travel: This variable categorizes the purpose of the flight.
Class: “Class” indicates the travel class in the plane.
Ordinal Variables
The following variables represent satisfaction levels, which are ordinal in nature, with values ranging from 0 to 5. According to the documentation, 0 is used to encode “Not Applicable” values.
Inflight Wifi Service: Satisfaction level of the inflight wifi service.
Departure/Arrival Time Convenient: Satisfaction level of departure/arrival time convenience.
Ease of Online Booking: Satisfaction level of online booking.
Gate Location: Satisfaction level of gate location.
Food and Drink: Satisfaction level of food and drink.
Online Boarding: Satisfaction level of online boarding.
Seat Comfort: Satisfaction level of seat comfort.
Inflight Entertainment: Satisfaction level of inflight entertainment.
On-board Service: Satisfaction level of on-board service.
Leg Room Service: Satisfaction level of leg room service.
Baggage Handling: Satisfaction level of baggage handling.
Check-in Service: Satisfaction level of check-in service.
Inflight Service: Satisfaction level of inflight service.
Cleanliness: Satisfaction level of cleanliness.
Target Variable
- Satisfaction: The “Satisfaction” variable represents the airline passenger’s satisfaction level and includes two categories: “satisfied” or “neutral or dissatisfied.” This will be our primary outcome variable for analysis.
In our research, we will explore how these variables interact and contribute to passenger satisfaction levels. We will use statistical methods and modeling techniques to gain insights into the factors that lead to customer satisfaction for an airline.
Variable limitations
While the analysis and insight generation opportunities are manyfold, certain fields in this dataset can present challenges limiting a resulting model’s predictive validity. These include:
Data collection: this dataset was sourced from Kaggle (Klein, 2020). While some variable-related documentation is available, we are not able to discern the circumstances under which this survey was distributed. The population may have been sampled through certain methods—such as convenience sampling—that make resulting data less representative of the overall population despite the large observation count. The overall population in question also is not clear; the survey may have focused on a particular airport or region, limiting potential predictive validity in alternative settings.
Loyal/disloyal clarity: the document does not elaborate upon what counts as a “loyal” or “disloyal” customer for that field. This makes it difficult to properly interpret the effects of such a variable in a regression model. The threshold for disloyalty could potentially range from using any other airlines at all to using other airlines a majority of the time, drastically altering any potential real-world applications.
Ticket prices: ticket prices are not included in this survey, with class serving as a rough proxy; intuitively, such prices could play a major factor in passengers’ service expectations and their subsequent ratings. The lack of price ranges associated with seat class also makes it difficult to encode the three categories in a way that accurately captures the disparity.
Loading the Data
We first imported the data into R by using read.csv()
function. The first few rows in the dataset are included below.
| X | id | Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70172 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18 | neutral or dissatisfied |
| 1 | 5047 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6 | neutral or dissatisfied |
| 2 | 110028 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0 | satisfied |
| 3 | 24026 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9 | neutral or dissatisfied |
| 4 | 119299 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0 | satisfied |
Checking data structure and dimensions
Data structure
## 'data.frame': 103904 obs. of 25 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 13 25 26 25 61 26 47 52 41 20 ...
## $ Type.of.Travel : chr "Personal Travel" "Business travel" "Business travel" "Business travel" ...
## $ Class : chr "Eco Plus" "Business" "Business" "Business" ...
## $ Flight.Distance : int 460 235 1142 562 214 1180 1276 2035 853 1061 ...
## $ Inflight.wifi.service : int 3 3 2 2 3 3 2 4 1 3 ...
## $ Departure.Arrival.time.convenient: int 4 2 2 5 3 4 4 3 2 3 ...
## $ Ease.of.Online.booking : int 3 3 2 5 3 2 2 4 2 3 ...
## $ Gate.location : int 1 3 2 5 3 1 3 4 2 4 ...
## $ Food.and.drink : int 5 1 5 2 4 1 2 5 4 2 ...
## $ Online.boarding : int 3 3 5 2 5 2 2 5 3 3 ...
## $ Seat.comfort : int 5 1 5 2 5 1 2 5 3 3 ...
## $ Inflight.entertainment : int 5 1 5 2 3 1 2 5 1 2 ...
## $ On.board.service : int 4 1 4 2 3 3 3 5 1 2 ...
## $ Leg.room.service : int 3 5 3 5 4 4 3 5 2 3 ...
## $ Baggage.handling : int 4 3 4 3 4 4 4 5 1 4 ...
## $ Checkin.service : int 4 1 4 1 3 4 3 4 4 4 ...
## $ Inflight.service : int 5 4 4 4 3 4 5 5 1 3 ...
## $ Cleanliness : int 5 1 5 2 3 1 2 4 2 2 ...
## $ Departure.Delay.in.Minutes : int 25 1 0 11 0 0 9 4 0 0 ...
## $ Arrival.Delay.in.Minutes : num 18 6 0 9 0 0 23 0 0 0 ...
## $ satisfaction : chr "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
X and id: These columns represent some unique identifiers for each observation. X appears to be an integer index, while id is also an integer and likely represents a customer ID or some form of identifier.
Gender: This column contains information about the gender of the passengers, with values such as “Male” and “Female.”
Customer.Type: This variable describes the customer as a “Loyal Customer” or a “disloyal Customer.”
Age: Represents the age of the passengers and is an integer variable.
Type.of.Travel: Indicates the purpose of travel with two levels, “Personal Travel” and “Business Travel.”
Class: Specifies the class of travel with three levels, including “Business,” “Economy,” and “Economy Plus.”
Flight.Distance: This variable contains the distance of the flight in miles as an integer.
Inflight.wifi.service, Departure.Arrival.time.convenient, and several other columns: These variables seem to represent passengers’ ratings or feedback on different aspects of their flight experience. They are integer variables with ratings ranging from 0 to 5.
Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes: These columns represent the delay in minutes for departure and arrival, respectively. Departure delay is an integer, while arrival delay is a numeric variable; this betrays initial expectations, since we would have expected both delay columns to contain identical types. The likely culprit is a discrepancy in respondents’ uses of decimal values to represent delays.
satisfaction: This is the target variable or the outcome of interest, and it represents customer satisfaction levels with values like “neutral or dissatisfied” and “satisfied.”
Data dimensions
This is a data frame with 103904 observations (rows) and 25 variables (columns). Assuming that a robust sampling method was utilized, the large number of observations may allow us to conclude that the data is generally representative of the actual population.
An initial description of the data
## data
##
## 25 Variables 103904 Observations
## ------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd
## 103904 0 103904 1 51952 34635
## .05 .10 .25 .50 .75 .90
## 5195 10390 25976 51952 77927 93513
## .95
## 98708
##
## lowest : 0 1 2 3 4
## highest: 103899 103900 103901 103902 103903
## ------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd
## 103904 0 103904 1 64924 43260
## .05 .10 .25 .50 .75 .90
## 6593 13044 32534 64857 97368 116884
## .95
## 123410
##
## lowest : 1 2 3 4 5
## highest: 129874 129875 129878 129879 129880
## ------------------------------------------------------------
## Gender
## n missing distinct
## 103904 0 2
##
## Value Female Male
## Frequency 52727 51177
## Proportion 0.507 0.493
## ------------------------------------------------------------
## Customer.Type
## n missing distinct
## 103904 0 2
##
## Value disloyal Customer Loyal Customer
## Frequency 18981 84923
## Proportion 0.183 0.817
## ------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd
## 103904 0 75 1 39.38 17.32
## .05 .10 .25 .50 .75 .90
## 14 20 27 40 51 59
## .95
## 64
##
## lowest : 7 8 9 10 11, highest: 77 78 79 80 85
## ------------------------------------------------------------
## Type.of.Travel
## n missing distinct
## 103904 0 2
##
## Value Business travel Personal Travel
## Frequency 71655 32249
## Proportion 0.69 0.31
## ------------------------------------------------------------
## Class
## n missing distinct
## 103904 0 3
##
## Value Business Eco Eco Plus
## Frequency 49665 46745 7494
## Proportion 0.478 0.450 0.072
## ------------------------------------------------------------
## Flight.Distance
## n missing distinct Info Mean Gmd
## 103904 0 3802 1 1189 1066
## .05 .10 .25 .50 .75 .90
## 175 236 414 843 1743 2750
## .95
## 3383
##
## lowest : 31 56 67 73 74, highest: 4243 4502 4817 4963 4983
## ------------------------------------------------------------
## Inflight.wifi.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.956 2.73 1.492
##
## Value 0 1 2 3 4 5
## Frequency 3103 17840 25830 25868 19794 11469
## Proportion 0.030 0.172 0.249 0.249 0.191 0.110
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Arrival.time.convenient
## n missing distinct Info Mean Gmd
## 103904 0 6 0.962 3.06 1.716
##
## Value 0 1 2 3 4 5
## Frequency 5300 15498 17191 17966 25546 22403
## Proportion 0.051 0.149 0.165 0.173 0.246 0.216
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Ease.of.Online.booking
## n missing distinct Info Mean Gmd
## 103904 0 6 0.961 2.757 1.578
##
## Value 0 1 2 3 4 5
## Frequency 4487 17525 24021 24449 19571 13851
## Proportion 0.043 0.169 0.231 0.235 0.188 0.133
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Gate.location
## n missing distinct Info Mean Gmd
## 103904 0 6 0.952 2.977 1.437
##
## Value 0 1 2 3 4 5
## Frequency 1 17562 19459 28577 24426 13879
## Proportion 0.000 0.169 0.187 0.275 0.235 0.134
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Food.and.drink
## n missing distinct Info Mean Gmd
## 103904 0 6 0.956 3.202 1.499
##
## Value 0 1 2 3 4 5
## Frequency 107 12837 21988 22300 24359 22313
## Proportion 0.001 0.124 0.212 0.215 0.234 0.215
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Online.boarding
## n missing distinct Info Mean Gmd
## 103904 0 6 0.951 3.25 1.501
##
## Value 0 1 2 3 4 5
## Frequency 2428 10692 17505 21804 30762 20713
## Proportion 0.023 0.103 0.168 0.210 0.296 0.199
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Seat.comfort
## n missing distinct Info Mean Gmd
## 103904 0 6 0.945 3.439 1.462
##
## Value 0 1 2 3 4 5
## Frequency 1 12075 14897 18696 31765 26470
## Proportion 0.000 0.116 0.143 0.180 0.306 0.255
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.entertainment
## n missing distinct Info Mean Gmd
## 103904 0 6 0.95 3.358 1.49
##
## Value 0 1 2 3 4 5
## Frequency 14 12478 17637 19139 29423 25213
## Proportion 0.000 0.120 0.170 0.184 0.283 0.243
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## On.board.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.947 3.382 1.433
##
## Value 0 1 2 3 4 5
## Frequency 3 11872 14681 22833 30867 23648
## Proportion 0.000 0.114 0.141 0.220 0.297 0.228
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Leg.room.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.95 3.351 1.471
##
## Value 0 1 2 3 4 5
## Frequency 472 10353 19525 20098 28789 24667
## Proportion 0.005 0.100 0.188 0.193 0.277 0.237
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Baggage.handling
## n missing distinct Info Mean Gmd
## 103904 0 5 0.926 3.632 1.282
##
## Value 1 2 3 4 5
## Frequency 7237 11521 20632 37383 27131
## Proportion 0.070 0.111 0.199 0.360 0.261
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Checkin.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.946 3.304 1.408
##
## Value 0 1 2 3 4 5
## Frequency 1 12890 12893 28446 29055 20619
## Proportion 0.000 0.124 0.124 0.274 0.280 0.198
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.924 3.64 1.274
##
## Value 0 1 2 3 4 5
## Frequency 3 7084 11457 20299 37945 27116
## Proportion 0.000 0.068 0.110 0.195 0.365 0.261
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Cleanliness
## n missing distinct Info Mean Gmd
## 103904 0 6 0.953 3.286 1.471
##
## Value 0 1 2 3 4 5
## Frequency 12 13318 16132 24574 27179 22689
## Proportion 0.000 0.128 0.155 0.237 0.262 0.218
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Delay.in.Minutes
## n missing distinct Info Mean Gmd
## 103904 0 446 0.82 14.82 24.68
## .05 .10 .25 .50 .75 .90
## 0 0 0 0 12 44
## .95
## 78
##
## lowest : 0 1 2 3 4, highest: 933 978 1017 1305 1592
## ------------------------------------------------------------
## Arrival.Delay.in.Minutes
## n missing distinct Info Mean Gmd
## 103594 310 455 0.823 15.18 25.15
## .05 .10 .25 .50 .75 .90
## 0 0 0 0 13 44
## .95
## 79
##
## lowest : 0 1 2 3 4, highest: 952 970 1011 1280 1584
## ------------------------------------------------------------
## satisfaction
## n missing distinct
## 103904 0 2
##
## Value neutral or dissatisfied satisfied
## Frequency 58879 45025
## Proportion 0.567 0.433
## ------------------------------------------------------------
- Variable X and ID:
- Variable ‘X’ is an integer index ranging from 0 to 103903 with no missing values.
- Variable ‘id’ represents customer IDs and is also an integer, ranging from 1 to 129880 with no missing values.
- Gender:
- There are two distinct values, ‘Female’ and ‘Male,’ with roughly equal proportions of female (50.7%) and male (49.3%) passengers.
- Customer Type:
- Two distinct types of customers are present: ‘disloyal Customer’ and ‘Loyal Customer.’ ‘Loyal Customer’ is the dominant type, accounting for approximately 81.7% of passengers.
- Age:
- The age variable ranges from 7 to 85 with a mean age of approximately 39.38. 50% of the respondents’ ages fall between 27 and 51.
- Type of Travel:
- There are two types of travel: ‘Business travel’ (69.0%) and ‘Personal Travel’ (31.0%). Business travel is the more common type by far.
- Class:
- Three distinct classes are available: ‘Business,’ ‘Eco,’ and ‘Eco Plus.’
- ‘Business’ class is the most popular (47.8%), followed by ‘Eco’ (45.0%) and ‘Eco Plus’ (7.2%).
- Flight Distance:
- The mean flight distance is approximately 1189 miles, with values ranging from 175 to 3383 miles.
- Inflight Wifi Service, Departure Arrival
Time Convenient, Ease of Online Booking,
Gate Location, Food and Drink,
Online Boarding, Seat Comfort,
Inflight Entertainment, On-Board
Service, Legroom Service, Baggage
Handling, Check-In Service, Inflight
Service, and Cleanliness:
- These variables represent passengers’ ratings on a scale from 0 to 5 for various aspects of their flight experience.
- The mean ratings for each of these variables fall between 2.73 and 3.64.
- 4 appears to be the most commonly selected option for most individual ratings.
- Departure Delay in Minutes:
- The majority of flights have no departure delay (mean delay of 14.82 minutes).
- Delays range from 0 to 78 minutes.
- Arrival Delay in Minutes:
- Arrival delays are similar to departure delays, with the majority having no delay (mean delay of 15.18 minutes).
- Delays range from 0 to 79 minutes.
- Satisfaction:
- There are two categories of satisfaction: ‘neutral or dissatisfied’ (56.7%) and ‘satisfied’ (43.3%).
- Overall, more passengers appear to be ‘neutral or dissatisfied’ with their flight experience.
Data Pre-processing
Duplicate values
It has total 0 duplicate values
Missing Values
The following table shows the NA values in our dataset:| X | id | Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 310 | 0 |
We elected to replace these 310 NA values in arrival delays with the median delay; this method was used over other potential replacement options, such as the average, due to the skewed distribution of values detailed later on.
The table below demonstrates that all missing values have been replaced; the “X” and “id” fields for index number and survey ID are also removed from the data frame due to their limited relevance for modeling.
Responses for the ratings variables are coded as values from 1-5. However, some responses include 0; as noted earlier, this indicates that the question was not applicable. Respondents that select this option for any of the ratings variables are filtered out to ensure that all of the individual ratings are relevant for all observations. While alternatives exist, such as replacement, the large number of initial observations limited our concerns over a potential loss in predictive validity.
| Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Summary Statistics
The following output features summary statistics for the continuous variables:
summary_stats_numeric
## Age Flight.Distance Departure.Delay.in.Minutes
## Min. : 7.0 Min. : 31 Min. : 0
## 1st Qu.:28.0 1st Qu.: 438 1st Qu.: 0
## Median :40.0 Median : 867 Median : 0
## Mean :39.8 Mean :1222 Mean : 15
## 3rd Qu.:51.0 3rd Qu.:1773 3rd Qu.: 13
## Max. :85.0 Max. :4983 Max. :1592
## Arrival.Delay.in.Minutes
## Min. : 0
## 1st Qu.: 0
## Median : 0
## Mean : 15
## 3rd Qu.: 13
## Max. :1584
The following output features summary statistics for the categorical/ordinal variables:
summary_stats_categorical
## Gender_n Gender_n_distinct Gender_top_freq
## 1 95704 2 Female
## Customer.Type_n Customer.Type_n_distinct
## 1 95704 2
## Customer.Type_top_freq Type.of.Travel_n
## 1 Loyal Customer 95704
## Type.of.Travel_n_distinct Type.of.Travel_top_freq Class_n
## 1 2 Business travel 95704
## Class_n_distinct Class_top_freq Inflight.wifi.service_n
## 1 3 Business 95704
## Inflight.wifi.service_n_distinct
## 1 5
## Inflight.wifi.service_top_freq
## 1 3
## Departure.Arrival.time.convenient_n
## 1 95704
## Departure.Arrival.time.convenient_n_distinct
## 1 5
## Departure.Arrival.time.convenient_top_freq
## 1 4
## Ease.of.Online.booking_n
## 1 95704
## Ease.of.Online.booking_n_distinct
## 1 5
## Ease.of.Online.booking_top_freq Gate.location_n
## 1 3 95704
## Gate.location_n_distinct Gate.location_top_freq
## 1 5 3
## Food.and.drink_n Food.and.drink_n_distinct
## 1 95704 5
## Food.and.drink_top_freq Online.boarding_n
## 1 4 95704
## Online.boarding_n_distinct Online.boarding_top_freq
## 1 5 4
## Seat.comfort_n Seat.comfort_n_distinct
## 1 95704 5
## Seat.comfort_top_freq Inflight.entertainment_n
## 1 4 95704
## Inflight.entertainment_n_distinct
## 1 5
## Inflight.entertainment_top_freq On.board.service_n
## 1 4 95704
## On.board.service_n_distinct On.board.service_top_freq
## 1 5 4
## Leg.room.service_n Leg.room.service_n_distinct
## 1 95704 5
## Leg.room.service_top_freq Baggage.handling_n
## 1 4 95704
## Baggage.handling_n_distinct Baggage.handling_top_freq
## 1 5 4
## Checkin.service_n Checkin.service_n_distinct
## 1 95704 5
## Checkin.service_top_freq Inflight.service_n
## 1 4 95704
## Inflight.service_n_distinct Inflight.service_top_freq
## 1 5 4
## Cleanliness_n Cleanliness_n_distinct Cleanliness_top_freq
## 1 95704 5 4
## satisfaction_n satisfaction_n_distinct
## 1 95704 2
## satisfaction_top_freq
## 1 neutral or dissatisfied
Examining variable distributions
Frequency distributions for categorical variables
The plots above provide visual representations for the summary statistics detailed earlier. While none initially appear to be highly correlated, we intend to confirm this using variance inflation factor (VIF) analysis at a later time once our model is fleshed out (“vif: Variance Inflation Factors”, n.d.).
Given a robust sampling method, we can safely assume that these distributions (including the highly skewed ones) are representative of the overall population.
Looking at the distribution of class, Eco Plus has a significantly lower observation frequency than the other two. In addition, as noted earlier, the magnitudes of increments between Eco, Eco Plus, and Business are not clear; some transformation may be required later to ensure modeling suitability.
Frequency distributions for continuous variables
From the graphs above, flight distance as well as both delay variables have a strongly right-skewed distribution. This makes sense intuitively; we would expect most flights to have minimal to no delays, and shorter flights are likely more frequent.
Age is the only variable that somewhat approximates a normal distribution (although that cannot be safely assumed); the current graph appears to be bimodal to a degree, with a small peak around 20-25 and another peak roughly around 35-50.
Depending on the type of regression that is ultimately selected, some of these variables may require aggressive transformations to better approximate normal distributions.
Frequency distributions for ordinal variables (Ratings)
Departure Arrival time convenient, Food and Drinks, Online boarding, Seat comfort, Inflight Entertainment, On board service, Leg room service, Baggage handling, Checkin service, Inflight service and Cleanliness all have a mode value of 4. Inflight wifi service, Gate location and Ease of online booking all have a mode value of 3. Many of the distributions for individual ratings variables look quite similar, raising multicollinearity concerns that will be addressed later.
Distribution of continuous variable features by satisfaction - KDE (Kernel Density Estimation)
Observations
Age: Middle-aged passengers tend to exhibit higher levels of satisfaction compared to both younger and older age groups, peaking around 40-50 years of age. Meanwhile, the distribution of neutral/dissatisfied passengers peaks noticeably earlier. If age is proven to be a significant factor, this could be utilized to engage in age-targeted improvements.
Flight Distance: Passengers traveling shorter distances appear to be more inclined towards neutrality or dissatisfaction compared to those embarking on longer journeys. This insight suggests that there might be unique challenges or aspects of shorter flights that influence passenger contentment and warrant further investigation.
Arrival/Departure Delays: It is difficult to discern any meaningful differences between passengers that were satisfied or neutral/dissatisfied based on arrival or departure delay durations using this method. To expand upon these visuals—potentially revealing more significant observations—we utilized a scatter plot.
Visualizing the relationship between Arrival and Departure delays colored by satisfaction.
This graph also indicates that arrival and departure delays follow a roughly similar linear trajectory, potentially foreshadowing high correlation between these fields.
Multicollinearity Testing
One of the essential steps in data analysis is assessing multicollinearity among independent variables. Multicollinearity occurs when predictor variables are highly correlated with each other, which can impact the reliability of regression models.
Correlation Matrices
To begin examining fields with respect to multicollinearity, we used two correlation matrices:
Continuous variables
Ratings variables
Continuous Variable Correlations
## Age Flight.Distance
## Min. :-0.016 Min. :-0.004
## 1st Qu.:-0.014 1st Qu.:-0.001
## Median : 0.035 Median : 0.042
## Mean : 0.264 Mean : 0.270
## 3rd Qu.: 0.312 3rd Qu.: 0.312
## Max. : 1.000 Max. : 1.000
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## Min. :-0.013 Min. :-0.016
## 1st Qu.:-0.003 1st Qu.:-0.007
## Median : 0.480 Median : 0.478
## Mean : 0.487 Mean : 0.485
## 3rd Qu.: 0.970 3rd Qu.: 0.970
## Max. : 1.000 Max. : 1.000
As observed earlier, arrival and departure delays appear to be highly
correlated; certain steps, such as removing one of the two or
calculating an average delay variable, would likely be necessary for use
in a predictive model.
Ratings Variable Correlations
Outside of continuous variables, many of the ratings appear to share similar frequency distributions based on the graphs displayed earlier, sparking significant multicollinearity concerns. Our next step to evaluate these potential relationships was to create another correlation matrix.
We can see from the matrix that certain ratings variables have strong positive correlations with each other. If these are included in the model without adjustments, our model may suffer a loss in reliability.
In order to avoid this issue, we elected to combine ratings variables into two groups—based on the degree of correlation—and utilize average ratings from these two groups as model inputs.
| Ratings Group 1: Pre-Flight & Wi-Fi | Ratings Group 2: In-Flight & Baggage |
|---|---|
| In-Flight Wifi Service | Food and Drink |
| Departure / Arrival Time | Seat Comfort |
| Ease of Online Booking | In-Flight Entertainment |
| Gate Location | Onboard Service |
| Online Boarding | Leg Room Service |
| Baggage Handling | |
| Check-In Service | |
| In-Flight Service | |
| Cleanliness |
## Pre_Flight_and_WiFi_Ratings In_Flight_and_Baggage_Ratings
## Min. :1.00 Min. :1.11
## 1st Qu.:2.40 1st Qu.:2.78
## Median :3.00 Median :3.44
## Mean :3.04 Mean :3.41
## 3rd Qu.:3.80 3rd Qu.:4.00
## Max. :5.00 Max. :5.00
Probability and standard OLS estimates
Before engaging in further analysis, we first identified that satisfaction—as a categorical/binary variable—runs into a fundamental interpretation issue under a standard linear model, where the standard linear model is not bounded between 0 and 1 in the same manner as our satisfaction variable. Under certain inputs, the linear model predicts unattainable values between satisfied or neutral/dissatisfied (encoded as 1 and 0 respectively), and key assumptions of linearity and homoskedasticity are violated.
Despite this restriction, linear probability models remain in widespread use, particularly among social scientists, making this a potentially fruitful avenue for a predictive model (Allison, 2015). This largely stems from ease of interpretation and generation; unlike logit (to be discussed later), this directly predicts changes in probability rather than odds ratios, is easier to run, and approximates logit for the 0.2-0.8 probability range in most cases (Allison, 2020). We generated a linear model and used a t-test with robust standard errors to account for violated homoskedasticity assumptions.
##
## Call:
## lm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.076 -0.223 0.005 0.198 1.426
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.31e+00 6.57e-03 -198.79
## Gender -3.81e-04 2.14e-03 -0.18
## Customer.Type 3.57e-01 3.40e-03 105.08
## Age 1.87e-04 7.44e-05 2.51
## Type.of.Travel 4.35e-01 3.08e-03 140.99
## Class 1.25e-01 2.96e-03 42.30
## Flight.Distance 5.88e-06 1.24e-06 4.74
## Pre_Flight_and_WiFi_Ratings 9.07e-02 1.18e-03 76.63
## In_Flight_and_Baggage_Ratings 2.29e-01 1.46e-03 157.09
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Gender 0.858
## Customer.Type < 2e-16 ***
## Age 0.012 *
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 2.1e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.33 on 95695 degrees of freedom
## Multiple R-squared: 0.554, Adjusted R-squared: 0.554
## F-statistic: 1.48e+04 on 8 and 95695 DF, p-value: <2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -1.31e+00 5.69e-03 -229.60
## Gender -3.81e-04 2.14e-03 -0.18
## Customer.Type 3.57e-01 3.89e-03 91.72
## Age 1.87e-04 7.60e-05 2.46
## Type.of.Travel 4.35e-01 3.40e-03 127.92
## Class 1.25e-01 3.36e-03 37.28
## Flight.Distance 5.88e-06 1.22e-06 4.81
## Pre_Flight_and_WiFi_Ratings 9.07e-02 1.26e-03 72.12
## In_Flight_and_Baggage_Ratings 2.29e-01 1.52e-03 150.13
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Gender 0.859
## Customer.Type < 2e-16 ***
## Age 0.014 *
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 1.5e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on our linear model, all inputs apart from gender and age have statistically significant impacts on satisfaction likelihood. As mentioned earlier, one major advantage from the linear model is that coefficients can be easily interpreted. For instance, loyal customers display a 0.357 (35.7%) increase in predicted satisfaction probability relative to others. In a similar vein, the model predicts a 43.5% higher satisfaction probability for passengers traveling for business relative to others. For the non-binary aggregated ratings, a 1-point increase corresponds to 9.07% and 22.9% predicted satisfaction probability increases for the pre-flight and in-flight groups respectively.
However, to confirm that the linear model is indeed a practically valuable predictor, we can’t rely solely on the dataset used for training; our source provides a second testing dataset for which we can repeat cleaning/encoding steps and apply our model. Since gender and age are not significant, we elected to remove them prior to this step (marking this as a “v2” model). Using a confusion matrix, we determined that the v2 model’s “accuracy”—the proportion of correctly predicted satisfaction values out of all respondents—is over 80% for the testing dataset. Based on this information, we can conclude that the linear model is a reasonably good predictor that isn’t overfitting the training data.
##
## Call:
## lm(formula = satisfaction ~ Customer.Type + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.076 -0.222 0.005 0.198 1.425
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.30e+00 6.22e-03 -209.08
## Customer.Type 3.59e-01 3.28e-03 109.41
## Type.of.Travel 4.36e-01 3.07e-03 141.88
## Class 1.25e-01 2.95e-03 42.45
## Flight.Distance 5.77e-06 1.24e-06 4.66
## Pre_Flight_and_WiFi_Ratings 9.07e-02 1.18e-03 76.70
## In_Flight_and_Baggage_Ratings 2.29e-01 1.46e-03 157.15
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Customer.Type < 2e-16 ***
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 3.2e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.33 on 95697 degrees of freedom
## Multiple R-squared: 0.554, Adjusted R-squared: 0.554
## F-statistic: 1.98e+04 on 6 and 95697 DF, p-value: <2e-16
data_test$predicted_probabilities_linear <- predict(linear_model_v2, newdata = data_test)
data_test$predicted_outcome_linear <- ifelse(data_test$predicted_probabilities_linear > 0.5, 1, 0)
confusion_matrix <- table(data_test$satisfaction, data_test$predicted_outcome_linear)
print(confusion_matrix)
##
## 0 1
## 0 11939 1651
## 1 1561 8712
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 3)))
## [1] "Accuracy: 0.865"
However, it is not yet clear that a linear model would be the best predictor available. Logistic regression, which predicts the log odds of satisfaction. is the dominant approach for modeling binary variables (Allison, 2015). Logistic regression models utilize different assumptions relative to linear models, significantly altering the necessary EDA steps. Rather than a linear relationship between parameters and the dependent variable, logistic regression assumes a linear relationship between parameters and the log odds. Independence of errors and multicollinearity remain as assumptions for both linear and logistic models. Homoskedasticity and normally distributed residuals are both not required under logistic regression (“Assumptions of Logistic Regression”, n.d.).
Unlike a standard linear regression, which assumes that independent parameters have a linear relationship with the dependent variable, logistic regression assumes that parameters have a linear relationship with the log odds (“Assumptions of Logistic Regression”, n.d.).
Odds represent the number of favorable outcomes divided by the number of unfavorable outcomes. Put differently, if “p” represents the probability of favorable outcomes, Odds = p/(1-p). Log odds take the natural log of the odds, which can be expressed as ln(p/1-p)) (Agarwal, 2019). We used visual test to examine whether or not this assumption holds true for continuous variables. While it is not sensible to compute log odds for individual data points, we grouped continuous variables into discrete buckets—calculating the average log odds for each—to examine whether or not they might satisfy this assumption.
Only flight distance, as well as in-flight and baggage ratings, displayed roughly linear relationships with log odds of satisfaction in our testing. Age appeared to have a parabolic relationship, peaking in the middle, indicating some sort of aggressive transformation method may be necessary to reach a linear relationship. Meanwhile, log odds for both delay statistics quickly dispersed in both directions as they increase (likely in part due to the limited frequency of higher durations), making it difficult to conclude with certainty that a linear relationship exists. Pre-flight and wi-fi ratings appear to have a significantly looser connection relative to in-flight ratings with a potential dip in log odds for average ratings.
Testing Linearity with log odds
Following visual testing, we generated a logit model in order to examine potential differences relative to the prior linear model. Rather than starting with a pared-down variable list, we returned to an expanded variable list to see if there were any distinctions in what the models deemed statistically significant. This proved to be informative; alongside gender and age, flight distance also failed to reach the threshold for statistical significance.
logit_model = glm(satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, data = data, family = "binomial")
summary(logit_model)
##
## Call:
## glm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.47e+01 9.86e-02 -148.85
## Gender 5.79e-03 2.06e-02 0.28
## Customer.Type 2.48e+00 3.19e-02 77.57
## Age 7.10e-04 7.42e-04 0.96
## Type.of.Travel 3.33e+00 3.24e-02 102.75
## Class 8.32e-01 2.56e-02 32.53
## Flight.Distance 1.45e-05 1.18e-05 1.23
## Pre_Flight_and_WiFi_Ratings 8.30e-01 1.23e-02 67.58
## In_Flight_and_Baggage_Ratings 1.96e+00 1.67e-02 116.80
## Pr(>|z|)
## (Intercept) <2e-16 ***
## Gender 0.78
## Customer.Type <2e-16 ***
## Age 0.34
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Flight.Distance 0.22
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 61179 on 95695 degrees of freedom
## AIC: 61197
##
## Number of Fisher Scoring iterations: 6
In order to compare this with the linear model, we generated another confusion matrix based on the testing data. In a similar fashion to the linear model, we created a “v2” model removing statistically insignificant inputs. The accuracy results were better than those of the linear model, but only slightly; it isn’t clear whether this marginal improvement would hold true given further testing with different survey data. The calculated McFadden pseudo-R^2 falls above 0.5.
##
## Call:
## glm(formula = satisfaction ~ Customer.Type + Type.of.Travel +
## Class + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.6554 0.0965 -151.9
## Customer.Type 2.4940 0.0298 83.7
## Type.of.Travel 3.3316 0.0321 103.8
## Class 0.8450 0.0236 35.9
## Pre_Flight_and_WiFi_Ratings 0.8302 0.0123 67.6
## In_Flight_and_Baggage_Ratings 1.9558 0.0167 116.9
## Pr(>|z|)
## (Intercept) <2e-16 ***
## Customer.Type <2e-16 ***
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 61181 on 95698 degrees of freedom
## AIC: 61193
##
## Number of Fisher Scoring iterations: 6
##
## 0 1
## 0 12635 955
## 1 2189 8084
## [1] "Accuracy: 0.868"
## [1] "McFadden R^2: 0.531"
Logistic Regression
Logistic regression is preferable to linear regression for binary or categorical outcomes, as it models probabilities bounded between 0 and 1. It handles non-linear relationships and provides odds ratios, making it suitable for risk assessment in fields like medicine. Logistic regression is also more robust to outliers and heteroscedasticity, unlike linear regression which assumes a continuous and linear relationship between variables.
##
## Call:
## glm(formula = satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
## Ease.of.Online.booking + Online.boarding + Seat.comfort +
## Inflight.entertainment + On.board.service + Leg.room.service +
## Baggage.handling + Checkin.service + Inflight.service + Cleanliness +
## Arrival.Delay.in.Minutes, family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.34e+01 9.93e-02 -134.91
## Age 1.20e-02 7.83e-04 15.33
## Type.of.Travel 2.35e+00 3.28e-02 71.69
## Class 1.26e+00 2.66e-02 47.32
## Inflight.wifi.service 6.25e-01 1.29e-02 48.35
## Ease.of.Online.booking -4.75e-02 1.11e-02 -4.27
## Online.boarding 1.03e+00 1.22e-02 84.35
## Seat.comfort 2.83e-02 1.26e-02 2.25
## Inflight.entertainment 3.12e-01 1.48e-02 21.16
## On.board.service 3.06e-01 1.14e-02 26.74
## Leg.room.service 3.62e-01 9.86e-03 36.74
## Baggage.handling 5.58e-02 1.27e-02 4.40
## Checkin.service 2.49e-01 9.39e-03 26.54
## Inflight.service 1.88e-02 1.34e-02 1.41
## Cleanliness 1.10e-01 1.27e-02 8.67
## Arrival.Delay.in.Minutes -3.87e-03 2.83e-04 -13.68
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Age < 2e-16 ***
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Inflight.wifi.service < 2e-16 ***
## Ease.of.Online.booking 1.9e-05 ***
## Online.boarding < 2e-16 ***
## Seat.comfort 0.024 *
## Inflight.entertainment < 2e-16 ***
## On.board.service < 2e-16 ***
## Leg.room.service < 2e-16 ***
## Baggage.handling 1.1e-05 ***
## Checkin.service < 2e-16 ***
## Inflight.service 0.159
## Cleanliness < 2e-16 ***
## Arrival.Delay.in.Minutes < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 54781 on 95688 degrees of freedom
## AIC: 54813
##
## Number of Fisher Scoring iterations: 6
Observations
Statistical Significance: Most variables have p-values less than 0.05, indicating they significantly influence the dependent variable.
Positive and Negative Relationships: Positive coefficients (e.g., ‘Age’, ‘Type of Travel’) suggest a positive relationship with the outcome, whereas negative coefficients (e.g., ‘Ease of Online booking’, ‘Arrival Delay in Minutes’) indicate a negative relationship.
High Impact Factors: Variables with larger coefficients and small standard errors, like ‘Online boarding’ and ‘Type of Travel’, may have a more substantial impact on the outcome.
Model Fit: The large difference between the null and residual deviance suggests a good model fit.
Non-significant Variables: Some variables, like ‘Inflight service’, do not show statistical significance, implying a weaker or no influence on the dependent variable.
Good Predictive Ability: The model seems capable of predicting the outcome effectively, given the significance and size of most coefficients.
## log_pred_class
## 0 1
## 0 12151 1439
## 1 1445 8828
The model’s performance was evaluated using various metrics. The results are as follows:
- Accuracy: 0.879 (The proportion of true results among the total number of cases)
- Precision: 0.859 (The proportion of true positives among all positive predictions)
- Recall: 0.86 (The proportion of true positives among all actual positives)
- F1 Score: 0.86 (The harmonic mean of precision and recall)
- Specificity: 0.894 (The proportion of true negatives among all actual negatives)
Observations
Model Accuracy (0.879): The accuracy is quite high, at 87.9%. This means that the model correctly predicts whether a customer is satisfied or not in approximately 88 out of 100 cases. It’s a good indicator of overall performance, but it’s important to consider other metrics as well, especially if the data set is imbalanced.
Precision (0.859) and Recall (0.86): Both precision and recall are also high, around 86%. Precision indicates that when the model predicts customer satisfaction, it is correct 85.9% of the time. Recall tells us that the model successfully identifies 86% of actual satisfied customers. These metrics are particularly important in scenarios where the costs of false positives and false negatives are different.
F-Measure (0.86): The F-Measure, which balances precision and recall, is also 0.86. This suggests a good balance between precision and recall in the model, which is crucial for a well-rounded predictive performance.
Specificity (0.894): The specificity is 89.4%, indicating that the model is quite good at identifying true negatives - i.e., it correctly identifies customers who are not satisfied.
Area Under the Curve (AUC) of ROC (0.948): The AUC value is 0.948, which is very close to 1. This high value indicates that the model has an excellent ability to discriminate between satisfied and unsatisfied customers. It implies that the model has a high true positive rate and a low false positive rate.
Overall, the model exhibits strong predictive capabilities across various metrics, indicating that it is well-tuned for this particular task. However, it’s always important to consider the context and the potential impact of misclassifications. Also, examining other aspects like model interpretability, feature importance, and the performance on different segments of the data can provide deeper insights
## Age Type.of.Travel
## 1.06 1.42
## Class Inflight.wifi.service
## 1.47 1.85
## Ease.of.Online.booking Online.boarding
## 1.68 1.27
## Seat.comfort Inflight.entertainment
## 1.85 2.41
## On.board.service Leg.room.service
## 1.58 1.18
## Baggage.handling Checkin.service
## 1.71 1.16
## Inflight.service Cleanliness
## 1.87 2.02
## Arrival.Delay.in.Minutes
## 1.02
Observations for vif
The results are generally good, indicating that for most of the model’s predictors, multicollinearity is not a significant issue. Observations for vif
## Area under the curve: 0.948
Observations for ROC:
The ROC curve displayed is highly indicative of an excellent predictive model, with an AUC (Area Under the Curve) of 0.95, showing exceptional discrimination ability between the positive and negative classes. The curve stays well above the diagonal line of no-discrimination, signaling strong performance.
Decision Tree
Data Preparation
Data Type Conversion
- Certain columns in both training and testing datasets are converted to factors to reflect their ordinal nature.
Column Datatype Changes - Testing Data: Conversion of certain columns to factors based on their ordinal nature.
data_test$Inflight.wifi.service = as.factor(data_test$Inflight.wifi.service)
data_test$Departure.Arrival.time.convenient = as.factor(data_test$Departure.Arrival.time.convenient)
data_test$Ease.of.Online.booking = as.factor(data_test$Ease.of.Online.booking)
data_test$Gate.location = as.factor(data_test$Gate.location)
data_test$Food.and.drink = as.factor(data_test$Food.and.drink)
data_test$Online.boarding = as.factor(data_test$Online.boarding)
data_test$Seat.comfort = as.factor(data_test$Seat.comfort)
data_test$Inflight.entertainment = as.factor(data_test$Inflight.entertainment)
data_test$On.board.service = as.factor(data_test$On.board.service)
data_test$Leg.room.service = as.factor(data_test$Leg.room.service)
data_test$Baggage.handling = as.factor(data_test$Baggage.handling)
data_test$Checkin.service = as.factor(data_test$Checkin.service)
data_test$Inflight.service = as.factor(data_test$Inflight.service)
data_test$Cleanliness = as.factor(data_test$Cleanliness)
Column Datatype Changes - Training Data: Similar data type conversions for training data.
#Column datatype Changes - Training Data - As Columns has ordinal its better to convert into factor
data$Inflight.wifi.service = as.factor(data$Inflight.wifi.service)
data$Departure.Arrival.time.convenient = as.factor(data$Departure.Arrival.time.convenient)
data$Ease.of.Online.booking = as.factor(data$Ease.of.Online.booking)
data$Gate.location = as.factor(data$Gate.location)
data$Food.and.drink = as.factor(data$Food.and.drink)
data$Online.boarding = as.factor(data$Online.boarding)
data$Seat.comfort = as.factor(data$Seat.comfort)
data$Inflight.entertainment = as.factor(data$Inflight.entertainment)
data$On.board.service = as.factor(data$On.board.service)
data$Leg.room.service = as.factor(data$Leg.room.service)
data$Baggage.handling = as.factor(data$Baggage.handling)
data$Checkin.service = as.factor(data$Checkin.service)
data$Inflight.service = as.factor(data$Inflight.service)
data$Cleanliness = as.factor(data$Cleanliness)
Decision Tree Model Building
Initial Model Building: A decision tree (
tree) is constructed using various predictors such as customer demographics, service ratings, and flight details.Variable Importance Analysis: The importance of each variable in the decision tree is evaluated to identify significant predictors.
This analysis helps in understanding which variables (predictors) are most influential in determining the target variable, in your case likely the ‘satisfaction’ of airline passengers.
- The class of travel and type of travel are the most influential factors in determining passenger satisfaction, indicating the importance of service level and travel purpose.
- Online and inflight services (boarding, entertainment, wifi) are also crucial, emphasizing the importance of digital experience and onboard comfort.
- Personal factors like Age have some influence but are overshadowed by service and experience-related factors.
- Several variables have no discernible impact on satisfaction in this model, suggesting that they might not be critical in the context of this specific dataset or the way the model was constructed.
This analysis provides valuable insights into what factors airlines should focus on to improve passenger satisfaction, particularly emphasizing service quality, both digital and onboard.
## Overall
## Age 168
## Arrival.Delay.in.Minutes 131
## Class 17608
## Ease.of.Online.booking 1671
## Inflight.entertainment 13115
## Inflight.wifi.service 12888
## Leg.room.service 4009
## On.board.service 2179
## Online.boarding 16997
## Type.of.Travel 17087
## Gender 0
## Customer.Type 0
## Flight.Distance 0
## Departure.Arrival.time.convenient 0
## Gate.location 0
## Food.and.drink 0
## Seat.comfort 0
## Baggage.handling 0
## Checkin.service 0
## Inflight.service 0
## Cleanliness 0
## Departure.Delay.in.Minutes 0
Refined Model: A second decision tree (
tree1) is built focusing only on the significant variables identified earlier.Decision Tree Visualization: The structure of the refined decision tree is visualized using
prp.
The decision tree shows a simplified model of how different factors contribute to the outcome of passenger satisfaction, which seems to be categorized as either satisfied or neutral.
Interpretation and Implications:
- Online Boarding is a significant determinant of initial satisfaction. A better online boarding experience leads directly to a higher chance of satisfaction, bypassing other factors.
- Inflight Entertainment is the second most crucial factor; however, its impact is nuanced by the previous experience with online boarding.
- Type of Travel being personal indicates a more significant expectation or reliance on Inflight Entertainment for satisfaction.
- It’s worth noting that the tree uses a binary split for satisfied and neutral, implying that dissatisfaction is possibly grouped with neutrality in this analysis, or dissatisfaction was not an outcome in the training data.
Based on this tree, to improve overall passenger satisfaction, an airline should focus on enhancing the online boarding process and the quality of inflight entertainment, especially for those traveling for personal reasons.
The tree simplifies the prediction of satisfaction and does not account for all the nuances or interactions between different factors but provides a quick and interpretable way to understand key drivers of satisfaction.
Model Tuning and Evaluation
Cross-Validation Setup: A 10-fold cross-validation is defined for tuning the complexity parameter (
cp).Cross-Validation Execution: The model is trained across a range of
cpvalues to find the optimal model.
## CART
##
## 95704 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 86133, 86134, 86134, 86133, 86133, 86134, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01 0.277 0.686 0.153
## 0.02 0.304 0.622 0.185
## 0.03 0.312 0.601 0.195
## 0.04 0.312 0.601 0.195
## 0.05 0.312 0.601 0.195
## 0.06 0.334 0.544 0.223
## 0.07 0.334 0.544 0.223
## 0.08 0.334 0.544 0.223
## 0.09 0.389 0.380 0.303
## 0.10 0.389 0.380 0.303
## 0.11 0.389 0.380 0.303
## 0.12 0.397 0.354 0.316
## 0.13 0.426 0.259 0.362
## 0.14 0.426 0.259 0.362
## 0.15 0.426 0.259 0.362
## 0.16 0.426 0.259 0.362
## 0.17 0.426 0.259 0.362
## 0.18 0.426 0.259 0.362
## 0.19 0.426 0.259 0.362
## 0.20 0.426 0.259 0.362
## 0.21 0.426 0.259 0.362
## 0.22 0.426 0.259 0.362
## 0.23 0.426 0.259 0.362
## 0.24 0.426 0.259 0.362
## 0.25 0.426 0.259 0.362
## 0.26 0.494 NaN 0.489
## 0.27 0.494 NaN 0.489
## 0.28 0.494 NaN 0.489
## 0.29 0.494 NaN 0.489
## 0.30 0.494 NaN 0.489
## 0.31 0.494 NaN 0.489
## 0.32 0.494 NaN 0.489
## 0.33 0.494 NaN 0.489
## 0.34 0.494 NaN 0.489
## 0.35 0.494 NaN 0.489
## 0.36 0.494 NaN 0.489
## 0.37 0.494 NaN 0.489
## 0.38 0.494 NaN 0.489
## 0.39 0.494 NaN 0.489
## 0.40 0.494 NaN 0.489
## 0.41 0.494 NaN 0.489
## 0.42 0.494 NaN 0.489
## 0.43 0.494 NaN 0.489
## 0.44 0.494 NaN 0.489
## 0.45 0.494 NaN 0.489
## 0.46 0.494 NaN 0.489
## 0.47 0.494 NaN 0.489
## 0.48 0.494 NaN 0.489
## 0.49 0.494 NaN 0.489
## 0.50 0.494 NaN 0.489
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was cp = 0.01.
Complexity Parameter (cp): The complexity parameter is a measure of the cost of adding additional splits to the tree. A smaller
cpvalue allows for more splits (i.e., a more complex tree), whereas a largercpvalue results in fewer splits (i.e., a simpler tree). The tuning process testedcpvalues from 0.01 up to 0.50.Model Performance Metrics: The performance of the model at each
cpvalue is evaluated using three metrics:- RMSE (Root Mean Squared Error): This measures the standard deviation of the prediction errors or residuals. Lower values are better as they indicate less deviation between the predicted and actual values.
- Rsquared: This is the coefficient of determination, indicating the proportion of the variance in the dependent variable that’s predictable from the independent variables. Higher values (close to 1) are better.
- MAE (Mean Absolute Error): This measures the average magnitude of the errors in a set of predictions, without considering their direction. Lower values are better.
Optimal Model: According to the summary, the optimal model was chosen with a
cpvalue of 0.01. This model has the smallest RMSE (0.278), a reasonably high Rsquared (0.685), and the lowest MAE (0.154), suggesting that it has the best predictive performance among the models tested.Model Overfitting Concerns: As
cpincreases, the RMSE and MAE tend to increase while Rsquared decreases, which may indicate that the model becomes too simple and starts to underfit the data. The optimalcpof 0.01 suggests that a more complex model performs better on this dataset.Degradation of Model Performance: Beyond a
cpvalue of 0.25, Rsquared values are not available (NaN), which might indicate that the model performance has degraded significantly, and the predictions are no longer reliable.Data and Resampling: The model was trained on a large sample of 95,704 instances and 15 predictors. The use of 10-fold cross-validation helps to ensure that the evaluation of the model’s performance is robust and not overly dependent on a particular split of the data.
In summary, the CART model performs best with a complexity parameter of 0.01, indicating that a model with more splits (thus more complexity) is better suited to this dataset. This model shows a good balance between bias and variance, with a relatively low prediction error and a decent explanation of variance, as per the given performance metrics.
Model Performance Analysis
- ROC Curve Plotting: The Receiver Operating Characteristic (ROC) curve is plotted to evaluate the model’s true positive rate vs. false positive rate.
2. Confusion Matrix and Accuracy: The confusion matrix
is used to calculate the model’s accuracy at an optimal threshold
identified from the ROC curve.
##
## FALSE TRUE
## 0 12139 1451
## 1 1720 8553
- Performance Metrics Calculation: Key metrics including Accuracy, Sensitivity (Recall), Precision, F-Measure, and Specificity are calculated.
The model’s performance was evaluated using various metrics. The results are as follows:
- Accuracy: 0.867 (The proportion of true results among the total number of cases)
- Precision: 0.833 (The proportion of true positives among all positive predictions)
- Recall: 0.855 (The proportion of true positives among all actual positives)
- F1 Score: 0.844 (The harmonic mean of precision and recall)
- Specificity: 0.893 (The proportion of true negatives among all actual negatives)
- AUC-ROC Value: The Area Under the Curve (AUC) for the ROC is computed, providing a single measure of the model’s overall performance.
#Testing Data AUC-ROC(Area Under the Curve - Receiver operator Characteristics) value
AUC = as.numeric(performance(pred, "auc")@y.values)
- AUC-ROC Value: 0.896
Random Forest Model
## $accuracy
## [1] 0.0527
##
## $precision
## [1] 0
##
## $recall
## [1] 0
##
## $f_measure
## [1] NaN
##
## $specificity
## [1] 0.0925
##
## $AUC
## Area under the curve: 0.976
Observations:
The output indicates high effectiveness of the model in predicting customer satisfaction. The accuracy of 95.7% shows that the model correctly predicts satisfaction in most cases. High precision (92.9%) and recall (96.7%) suggest the model is reliable in identifying satisfied customers and minimizing false positives. The F-measure of 94.8% and specificity of 97.7% further confirm the model’s robustness. The AUC value of 95.3% indicates excellent model performance in distinguishing between satisfied and unsatisfied customers.
Observations:
The curve plots the true positive rate (Sensitivity) against the false positive rate (1 - Specificity) and demonstrates a steep ascent toward the upper-left corner, indicative of a high true positive rate and low false positive rate. This steepness suggests that the model has a strong performance in distinguishing between classes. The curve’s significant elevation above the diagonal dashed line—which represents a random guess—further underscores the model’s effective discriminative power. Although the Area Under the Curve (AUC) value is not shown, the shape of the ROC curve implies a high AUC, signifying that the classifier performs much better than chance in predicting the positive class.
Conclusion
Citations
Klein, TJ (2020). Airline Passenger Satisfaction. Kaggle. https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction?select=train.csv
Lutz, A., & Lubin, G. (2012). Airlines Have An Insanely Small Profit Margin. Business Insider. https://www.businessinsider.com/airlines-have-a-small-profit-margin-2012-6
Hardee, H. (2023). Frontier reports lacklustre Q3 results as it struggles in ‘over-saturated’ core markets. FlightGlobal. https://www.flightglobal.com/strategy/frontier-reports-lacklustre-q3-results-as-it-struggles-in-over-saturated-core-markets/155561.article
vif: Variance Inflation Factors. (n.d.). R Package Documentation. https://rdrr.io/cran/car/man/vif.html
Allison, P. (2015, April 1). What’s So Special About Logit?. Statistical Horizons. https://statisticalhorizons.com/whats-so-special-about-logit/
Assumptions of Logistic Regression. (n.d.). Statistics Solutions. https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-logistic-regression/
Agarwal, P. (2019, July 8). WHAT and WHY of Log Odds. Towards Data Science. https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704